{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Model selection using lmfit and emcee "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`lmfit.emcee` can be used to obtain the posterior probability distribution of parameters, given a set of experimental data. This notebook shows how it can be used for Bayesian model selection."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import lmfit\n",
    "import matplotlib.pyplot as plt\n",
    "import corner\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# calculate a Gaussian\n",
    "def gauss(x, a_max, loc, sd):\n",
    "    return a_max * np.exp(-((x - loc) / sd)**2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Generate some data\n",
    "x = np.linspace(3, 7, 250)\n",
    "np.random.seed(0)\n",
    "y = 4 + 10 * x + gauss(x, 200, 5, 0.5) + gauss(x, 60, 5.8, 0.2)\n",
    "dy = np.sqrt(y)\n",
    "y += dy * np.random.randn(np.size(y))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.errorbar(x, y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# the normalised residual for the data\n",
    "def residual(p, just_generative=False):\n",
    "    v = p.valuesdict()\n",
    "    generative = v['a'] + v['b'] * x\n",
    "    M = 0\n",
    "    while 'a_max%d' % M in v:\n",
    "        generative += gauss(x, v['a_max%d'%M], v['loc%d'%M], v['sd%d'%M])\n",
    "        M += 1\n",
    "\n",
    "    if just_generative:\n",
    "        return generative\n",
    "    return (generative - y) / dy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# create a Parameter set for the initial guesses\n",
    "def initial_peak_params(M):\n",
    "    p = lmfit.Parameters()\n",
    "    \n",
    "    # a and b give a linear background\n",
    "    a = np.mean(y)\n",
    "    b = 1\n",
    "    \n",
    "    # a_max, loc and sd are the amplitude, location and SD of each gaussian component\n",
    "    a_max = np.max(y)\n",
    "    loc = np.mean(x)\n",
    "    sd = (np.max(x) - np.min(x)) * 0.5\n",
    "\n",
    "    p.add_many(('a', np.mean(y), True, 0, 10), ('b', b, True, 1, 15))\n",
    "\n",
    "    for i in range(M):\n",
    "        p.add_many(('a_max%d'%i, 0.5 * a_max, True, 10, a_max),\n",
    "                   ('loc%d'%i, loc, True, np.min(x), np.max(x)),\n",
    "                   ('sd%d'%i, sd, True, 0.1, np.max(x) - np.min(x)))\n",
    "    return p"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Solving with `minimize` gives the Maximum Likelihood solution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "p1 = initial_peak_params(1)\n",
    "mi1 = lmfit.minimize(residual, p1, method='differential_evolution')\n",
    "\n",
    "lmfit.printfuncs.report_fit(mi1.params, min_correl=0.5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From inspection of the data above we can tell that there is going to be more than 1 Gaussian component, but how many are there? A Bayesian approach can be used for this model selection problem. We can do this with `lmfit.emcee`, which uses the `emcee` package to do a Markov Chain Monte Carlo sampling of the posterior probability distribution. `lmfit.emcee` requires a function that returns the log-posterior probability. The log-posterior probability is a sum of the log-prior probability and log-likelihood functions. \n",
    "\n",
    "The log-prior probability encodes information about what you already believe about the system. `lmfit.emcee` assumes that this log-prior probability is zero if all the parameters are within their bounds and `-np.inf` if any of the parameters are outside their bounds. As such it's a uniform prior. \n",
    "\n",
    "The log-likelihood function is given below.\n",
    "\n",
    "To use non-uniform priors then should include these terms in `lnprob`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# This is the log-likelihood probability for the sampling.\n",
    "def lnprob(p):\n",
    "    resid = residual(p, just_generative=True)\n",
    "    return -0.5 * np.sum(((resid - y) / dy)**2 + np.log(2 * np.pi * dy**2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To start with we have to create the minimizers and *burn* them in. We create 4 different minimizers representing 0, 1, 2 or 3 Gaussian contributions. To do the model selection we have to integrate the over the log-posterior distribution to see which has the higher probability. This is done using the `thermodynamic_integration_log_evidence` method of the `sampler` attribute contained in the `lmfit.Minimizer` object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Work out the log-evidence for different numbers of peaks\n",
    "total_steps = 310\n",
    "burn = 300\n",
    "thin = 10\n",
    "ntemps = 15\n",
    "workers=4\n",
    "log_evidence = []\n",
    "res = []\n",
    "\n",
    "# set up the Minimizers\n",
    "for i in range(4):\n",
    "    p0 = initial_peak_params(i)\n",
    "    # you can't use lnprob as a userfcn with minimize because it needs to be maximised\n",
    "    mini = lmfit.Minimizer(residual, p0)\n",
    "    out = mini.minimize(method='differential_evolution')\n",
    "    res.append(out)\n",
    "\n",
    "mini = []\n",
    "# burn in the samplers\n",
    "for i in range(4):\n",
    "    # do the sampling\n",
    "    mini.append(lmfit.Minimizer(lnprob, res[i].params))\n",
    "    out = mini[i].emcee(steps=total_steps, ntemps=ntemps, workers=workers, reuse_sampler=False,\n",
    "                        float_behavior='posterior')\n",
    "    # get the evidence\n",
    "    print(i, total_steps, mini[i].sampler.thermodynamic_integration_log_evidence())\n",
    "    log_evidence.append(mini[i].sampler.thermodynamic_integration_log_evidence()[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once we've burned in the samplers we have to do a collection run. We thin out the MCMC chain to reduce autocorrelation between successive samples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for j in range(6):\n",
    "    total_steps += 100\n",
    "    for i in range(4):\n",
    "        # do the sampling\n",
    "        res = mini[i].emcee(burn=burn, steps=100, thin=thin, ntemps=ntemps,\n",
    "                         workers=workers, reuse_sampler=True)\n",
    "        # get the evidence\n",
    "        print(i, total_steps, mini[i].sampler.thermodynamic_integration_log_evidence())\n",
    "        log_evidence.append(mini[i].sampler.thermodynamic_integration_log_evidence()[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.plot(log_evidence[-4:])\n",
    "plt.ylabel('Log-evidence')\n",
    "plt.xlabel('number of peaks')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The Bayes factor is related to the exponential of the difference between the log-evidence values.  Thus, 0 peaks is not very likely compared to 1 peak. But 1 peak is not as good as 2 peaks. 3 peaks is not that much better than 2 peaks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "r01 = np.exp(log_evidence[-4] - log_evidence[-3])\n",
    "r12 = np.exp(log_evidence[-3] - log_evidence[-2])\n",
    "r23 = np.exp(log_evidence[-2] - log_evidence[-1])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(r01, r12, r23)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These numbers tell us that zero peaks is 0 times as likely as one peak. Two peaks is 7e49 times more likely than one peak. Three peaks is 1.1 times more likely than two peaks. With this data one would say that two peaks is sufficient. Caution has to be taken with these values. The log-priors for this sampling are uniform but improper, i.e. they are not normalised properly.  Internally the lnprior probability is calculated as 0 if all parameters are within their bounds and `-np.inf` if any parameter is outside the bounds. The `lnprob` function defined above is the log-likelihood alone. Remember, that the log-posterior probability is equal to the sum of the log-prior and log-likelihood probabilities. Extra terms can be added to the lnprob function to calculate the normalised log-probability.  These terms would look something like:\n",
    "\n",
    "$\\log \\left(\\prod_i \\frac{1}{\\max_i - \\min_i}\\right)$\n",
    "\n",
    "where $\\max_i$ and $\\min_i$ are the upper and lower bounds for the parameter, and the prior is a uniform distribution. Other types of prior are possible. For example, you might expect the prior to be Gaussian."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
